Robust Bilingual Word Alignment for Machine Aided Translation
نویسندگان
چکیده
We have developed a new program called word_align for aligning parallel text, text such as the Canadian Hansards that are available in two or more languages. The program takes the output of char_align (Church, 1993), a robust alternative to sentence-based alignment programs, and applies word-level constraints using a version of Brown el al.'s Model 2 (Brown et al., 1993), modified and extended to deal with robustness issues. Word_align was tested on a subset of Canadian Hansards supplied by Simard (Simard et al., 1992). The combination of word_align plus char_align reduces the variance (average square error) by a factor of 5 over char_align alone. More importantly, because word_align and char_align were designed to work robustly on texts that are smaller and more noisy than the Hansards, it has been possible to successfully deploy the programs at AT&T Language Line Services, a commercial translation service, to help them with difficult terminology.
منابع مشابه
Improved Discriminative Bilingual Word Alignment
For many years, statistical machine translation relied on generative models to provide bilingual word alignments. In 2005, several independent efforts showed that discriminative models could be used to enhance or replace the standard generative approach. Building on this work, we demonstrate substantial improvement in word-alignment accuracy, partly though improved training methods, but predomi...
متن کاملStatistical machine translation with cascaded probabilistic transducers
Statistical machine translation is based on the idea to extract information from bilingual corpora, which can be used to generate new translations. The current work combines aspects from example-based machine translation and from grammar-based approaches, esp. bilingual regular grammars, to develop a statistical translation system based on cascaded transducers. These transducers can be construc...
متن کاملBilingual Word Spectral Clustering for Statistical Machine Translation
In this paper, a variant of a spectral clustering algorithm is proposed for bilingual word clustering. The proposed algorithm generates the two sets of clusters for both languages efficiently with high semantic correlation within monolingual clusters, and high translation quality across the clusters between two languages. Each cluster level translation is considered as a bilingual concept, whic...
متن کاملAssociation-Based Bilingual Word Alignment
Bilingual word alignment forms the foundation of current work on statistical machine translation. Standard wordalignment methods involve the use of probabilistic generative models that are complex to implement and slow to train. In this paper we show that it is possible to approach the alignment accuracy of the standard models using algorithms that are much faster, and in some ways simpler, bas...
متن کاملBilingual Multi-Word Term Tokenization for Chinese–Japanese Patent Translation
We propose to re-tokenize data with aligned bilingual multi-word terms to improve statistical machine translation (SMT) in technical domains. For that, we independently extract multi-word terms from the monolingual parts of the training data. Promising bilingual multi-word terms are then identified using the sampling-based alignment method by setting some threshold on translation probabilities....
متن کاملBiTAM: Bilingual Topic AdMixture Models for Word Alignment
We propose a novel bilingual topical admixture (BiTAM) formalism for word alignment in statistical machine translation. Under this formalism, the parallel sentence-pairs within a document-pair are assumed to constitute a mixture of hidden topics; each word-pair follows a topic-specific bilingual translation model. Three BiTAM models are proposed to capture topic sharing at different levels of l...
متن کامل